PART-1 (Project based)

DOMAIN: Automobile

• CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes --Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg

1. Import and warehouse data

Import all the given datasets and explore shape and size

Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use

Import the data from above steps into python.

2. Data cleansing

Missing/incorrect value treatment

No missing values found un the dataset.

Drop attribute/s if required using relevant functional knowledge

3. Data analysis & visualisation

Perform detailed statistical analysis on the data.

'mpg' is almost normally distributed 'cyl' found no outlier with smooth distribution 'disp' might be slightly skewed, chances of presence of outlier 'acc' is normally distributed with no outlier 'yr' is also normally distributed

Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis

Univariate analysis

There is one outliers lying above whiskers for 'mpg'

data distribution is not normal. No outlier can be seen.

data distribution is not normal. No outlier can be seen.

Distribution is not normal and there are 11 outliers can be seen above upper whisker.

There are no outliers distribution is skewed towards right

Data is normally distributed with 9 outliers on both sides of the whiskers

'Yr' range distribution is between 70 to 80.There are more than 1 mode can be seen. 'origin' distribution is found highest for 1 followed by 2 and 3.

Since, 'mpg','hp' and 'acc' have outliers. Let us take logaritmic transform for hp,mpg and acc to remove outliers.

Bi-variate analysis

Good 'mpg' can be expected for 4 cylinder. 'mpg' drops when the number of cylinder increases

With the increase in 'wt' 'mpg' of the vehicle decreases.

Multivariate analysis

From the above it can be seen that 'mpg' decreases with increase in 'cyl', increase in 'hp' and increase in 'wt'. 'disp' and 'hp' has inverse relatioship

'mpg' is negatively corelated with 'disp','hp','wt'. 'cyl' with respect to 'disp','hp','wt' is positively corelated. 'hp' is positively corelated with 'cyl','disp' and 'wt'.

4. Machine learning

Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data

K Means

creating clusters

The bend happened between 3 and 4. Hence, optimum number of clusters can be choosen 4 for the current dataset.

Grouped data in the different clusters for Kmeans clustering.

in terms of centroid distance, Group 1 has highest value for 'mpg' while group 2 has lowest value for the same. value for cylinder is highest for group 2 and lowest for group 1 value for displacement is highest for group 2 and lowest for group 1 'hp' is highest for group 2 and lowest for group 1 'wt' value is highest for group 2 and lowest for group 1 'acc' is highest for group 1 and lowest for group 2.
In terms of mean value, 'mpg' value is highest for group 1 followed by group 3, group 0 and group 2. for 'cyl' group 2 has highest mean. for 'disp' group3. for 'hp' group 2 for 'wt' group 2 has highest mean. for 'acc' it is group 1.

Hierarchical

There is a huge number of trees. Let us cut the dendrogram to get 2 clusters.

Group 2 has highest mean value for 'mpg' while group 1 has comaparitively lower one for the same. Group 1 has highest mean value for 'cyl' while group 2 has comaparitively lower one for the same. value for displacement is highest for group 1 'hp' is highest for group 2 'wt' value is highest for group 1 'acc' is highest for group 2.

Share your insights about the difference in using these two methods

As for as Kmeans clustering is concerned the number of clusters should be considered at the begining of the algorithm runs. Hence, one of the methods like elbow method is used to decide the optimum number of clusters by plotting number of clusters v/s within group sum of squared errors. And it was found that elbow occured at 4th cluster. Finally number of cluster decided was 4 and grouping was done for 4 clusters. And it was found that for K-means cluster, Group 1 has highest value for 'mpg' while group 2 has lowest value for the same. In terms of centroid distance, it was found that Group 1 has highest value for 'mpg' while group 2 has lowest value for the same. value for cylinder is highest for group 2 and lowest for group 1 value for displacement is highest for group 2 and lowest for group 1 'hp' is highest for group 2 and lowest for group 1 'wt' value is highest for group 2 and lowest for group 1 'acc' is highest for group 1 and lowest for group 2. In terms of mean value for Kmeans cluster it was found that, 'mpg' value is highest for group 1 followed by group 3, group 0 and group 2. for 'cyl' group 2 has highest mean. for 'disp' group3. for 'hp' group 2 for 'wt' group 2 has highest mean. for 'acc' it is group 1. For Hierarchical clustering which uses a bottom up approach optimum number of clusters was decided by trial and eroor method. It becomes cumbersome when the number of objects become more for agglomarative clustering. The default number of clusters for agglomarative clustering if distance_threshold is not None is 2. Anyways for the current dataset 2 clusters has been considered by looking at the dendrogram (how a horizontal line cuts the longest vertical). Based on the mean values for the Hierarchial cluster it was found that, Group 2 has highest mean value for 'mpg' while group 1 has comaparitively lower one for the same. Group 1 has highest mean value for 'cyl' while group 2 has comaparitively lower one for the same. value for displacement is highest for group 1 'hp' is highest for group 2 'wt' value is highest for group 1 'acc' is highest for group 2.

5. Answer below questions based on outcomes of using ML based methods

Mention how many optimal clusters are present in the data and what could be the possible reason behind it

For the above dataset different number of clusters has been tried by iterating through the loop for Kmeans cluster. And optimum number of cluster has been decided by 'elbow' method i.e.by plotting graph of within group sum of squared error v/s number of clusters and elbow happened at when the number of cluster is 4. Hence, it is decided to continue grouping with '4' number of clusters. There are a range of 1o clusters has been tried for kmeans algorithm and finally 4 clusters has been selected by elbow method and groups have been done for the above said clusters. For Hierarchial clustering, the optimum number of clusters has been decided by looking at the dendrogram. It becomes cumbersome when the number of objects become more for Hierarchial clustering. The default number of clusters for agglomarative clustering if distance_threshold is not None is 2. Anyways for the current dataset 2 clusters has been considered by looking at the dendrogram (how a horizontal line cuts the longest vertical).

Use linear regression model on different clusters separately and print the coefficients of the models individually

Linear regression model for original dataset

Linear Regression model for Kmeans clustering

Use linear regression model on different clusters separately and print the coefficients of the models individually

since,there are 4 clusters. We will fit the LR model seperately for each clusters.

For the cluster 1 there is a huge difference between training and testing accuracy.

For third cluster also there is a huge difference in training and testing accuracy can be seen

Linear Regression model for Hierarchial clustering

Since, there are two clusters formed in hierarchial clustering, we will fit the model, predict the result and print the coefficient seperately for two clusters i.e. cluster 1 and cluster 2

How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.

When handling with huge amount of data or if the dataset is large,if these data are not labeled and a significant undertaking is typically required (usually by individuals with domain knowledge). Even in simple tasks, such as labeling images or video data can require thousands of hours. Therefore, using the unsupervised learning technique of cluster analysis can aide in the process of providing labels to observed data. clustering algorithms used for analysis are K-means, Gaussian Mixture Models, and hierarchical clustering, all of which are based on using a distance measure to assess the similarity of observations. clustering groups a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

6. Improvisation

Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to perform a better data analysis in future

The main aim of this dataset is to seggregate the datapoints in to different groups(clustering) then apply regression model finally to predict the milege of the vehicle. 1. It was found that, some outliers in 'mpg','cyl', and 'acc' found in the dataset and its been taken care off. 2. '?' symbol was found in 'hp' attribute its been taken care off.Those kinds of symbols could be avoided. 3. There are some attributes like 'cyl','disp' which has least impact on the prediction. Variables with strong corelation increases the performance of the model. 4. Data distribution was not normal with outliers present for some of the dataset that could be avoided.

PART-II (DOMAIN: Manufacturing)- Wine quality prediction

Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data.

There are 18 null values found in the dataset

distribution is not normal and values in terms of scaling is not too much deviated.

Comparing the original target variable with the newly formed cluster's value. It was found that the target variable and the newly formed clusters value matches excatly. We can sucessfully replace the missing entries in the old target variable with the new one column of Cluster.

Missing values (null values) has been replaced with the corresponding values from the cluster column

PART-III (DOMAIN: Automobile)- classifying given silhouette as one of three types of vehicle

all the attributes are numeric except 'class'

all the missing values are imputed with median values.

Distribution is found normal for most of the attributes.

From the density plot its been observed that distribution is not normaland more than 1 modes can be seen for majority of the attributes. chances of outliers may present in the dataset.

The distribution of vehicle follows in the order Car of about 50.71% followed by bus of 25.77% and van of 23.52%.

'radius ratio','pr.axis_aspect_ratio', 'max.length_aspect_ratio','scaled_variance', 'scaled_variance1','scaled_radius_of_gyration1','skewness_about','skewness_about1' attributes have outliers.

There is significant difference between classes when compared with the mean and median with all the numeric attributes

All the outlers have been taken care by imputing with median value

Classifier: Design and train a best fit SVM classier using all the data attributes.

Dimensional reduction(PCA)

About 90% of the variance were captured by first four components, first 6 components captured about 95% of the variance. We can remove componets above 6. For PCA for the current dataset 6 components will be good.

Classifier: Design and train a best fit SVM classier using dimensionally reduced attributes.

Bothe the model give more than 90% accuracy on the test data, PCA used only 6 attributes to come up with an accuracy of 90%+ where as the model without pca used all the variables to come up with 90%+ accuracy, the difference can be illustrated even better if the dataset had been cursed with dimensionality, since its 18 variable in the original data the difference is very subtle.

PART-IV (DOMAIN: Sports Management)- Goal is to build a data driven batsman ranking model for the sports management company to make business decisions

EDA and visualisation

It is found that 50% of the dataset has null values. Since, it is individual performance in each attributes null values can't be replaced by central values. hence, droping the null values.

Seems there are quite a few outliers are present.

All the variable except fours with strike rate, strike rate with half centuries,strike rate with runs, have high correlation.

Build a data driven model to rank all the players in the dataset using all or the most important performance features.

About 92% of the variance were captured by first two components. It is good to consider two principal components for the further analysis.

Create a covariance matrix for identifying Principal components

scree plot to visualize the percentage of variance explained by each principal component.

PART-V Question based

1. List down all possible dimensionality reduction techniques that can be implemented using python

Mainly dimensionality reduction techniques can be classified in to two broad categories. I. Feature Selection & II. Dimensionality Reduction I. Feature Selection a. Missing value ratio -If the dataset has too many missing values, we use this approach to reduce the number of variables. We can drop the variables having a large number of missing values in them. b. Low Variance filter: We apply this approach to identify and drop constant variables from the dataset. The target variable is not unduly affected by variables with low variance, and hence these variables can be safely dropped c. High Correlation filter: A pair of variables having high correlation increases multicollinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly. d.Random Forest: This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction e. Both Backward Feature Elimination and Forward Feature Selection techniques take a lot of computational time and are thus generally used on smaller datasets. II. Dimensionality Reduction 1.Components/Factors based a. Factor Analysis: This technique is best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor. b.Principal Component Analysis (PCA): This is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible. c.Independent Component Analysis: We can use ICA to transform the data into independent components which describe the data using less number of components. 2. Projection based a.ISOMAP: We use this technique when the data is strongly non-linear b. t-Distributed Stochastic Neighbor Embedding (t-SNE):This technique also works well when the data is strongly non-linear. It works extremely well for visualizations as well c.UMAP: This technique works well for high dimensional data. Its run-time is shorter as compared to t-SNE

2. So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python.

Yes, it is possible to use dimensionality reduction rechnique on multimedia data also, that will compress the data. However, main challenge is compressing without much compromise in quality. For images, we think of the number of features as the number of pixels. So for a 64x64 image we have 4096 features! One way to reduce that number (and hopefully produce a more accurate model) is to effectively compress the image. We do this by trying to find a way of keeping as much information as possible about the image without losing the essential structure.Let us illustrate the same with a simple example.(courtesy:kaggle)

Import images into python using PIL or any other python image library

Display the image - actual

Display the image - matrix

there are 2062 images, each 64x64 dimension vextors.

the Y dataset here gives us the labels for these images, it's kind of weirdly ordered and this image represents the number

SL algorithms require a MxN dataframe to classify. Whereas here a single image is MxN, hence making it difficult for SL algorithm to intake data.

Apply any other algorithm of your choice. Note the accuracy (A1)

KNN

MLP

As we can see, this is a pretty poor model, only achieving ~30% overall accuracy on the test set. We're now goint to reduce the dimension of our training data and then retrain what we have. The objective here is going to be to reduce the number of dimensions of the image, but before we do that we need to decide what we want to reduce it to. To do that we're going to try and find the number of dimensions that keeps 95% of the variance of the original images.

we have reduced 4096 dimensions to just 292! But how good is this actually? Let's train PCA on our training set and transform the data, then print out an example

We can see it's far from perfect, but it's still clear what shape the hand is making Let's retrain our model with the dimensionally reduced training data:

KNN

Model accuracy is boosted from 42.32% accuracy on the test set to 43.13% for KNN algorithm.

MLP

Model accuracy is boosted from ~30% accuracy on the test set to ~65% for MLP algorithm after dimensionality reduction using PCA.